perm filename HYRS.RFS[UP,DOC] blob
sn#002762 filedate 1972-10-06 generic text, type T, neo UTF8
PRODUCTION INTERPRETER AND SCANNER.
This write-up describes briefly a group of programs which can be used
to compile production tables from a reasonably clear list of
productions, and then play with various input sentences. This is the
same production compiler and interpreter that is used by the SAIL
compiler. We make brief mention in the writeup of the method for
attaching semantic routines to your syntax analyzer.
I. Production Compiler
The production compiler expects an input file of a very special
format. The ultimate aim of this input file is to specify a sequence
of productions, but we must first specify the production alphabet,
both terminal and non-terminal.
The meta language for specifying productions consumes a few symbols
and conventions. First, all alphabetic characters in the input file
must be in upper case. The only delimiters are space and tab, so →AG
does NOT get interpreted as two separate symbols.
1. For various undisclosed reasons, we must first provide
alternate "names" for all single-letter delimiters we may need, such
as ( ) } ↑ [ ], etc. The pseudo-op <SYMBOLS> is given, followed by
pairs of
single-letter-delimiter crazy-alternate-name
See the example below for some instances of this phenomenon. The
crazier the name the better. I suggest that each of these names
begin with the character $, to avoid duplication elsewhere.
2. The terminal symbols of the language are then specfied in two
groups: first, those which the scanner knows about directly, and
second, the group of reserved words in the language (these look like
identifiers to the scanner, but you obviously desire a special
interpretation). The first group is initiated with the pseudo-op
<TERMINALS>, and must include all the single-letter-delimiters cited
in 1, plus three other symbols WHICH MUST ALWAYS BE EXACTLY AS
FOLLOWS:
IDENT NUMBR STRIN
These are the "names" that the scanner will return for (respectively)
identifiers that are not reserved words, numbers, and string
constants. The second group is started with <RESERVED-WORDS>, and is
merely a list of all the reserved words you desire.
3. The non-terminal symbols are then declared. These are things
which you may want to put on the stack as "markers" of partially
completed reductions. The pseudo-op <NON-TERMINAL-SYMBOLS> is
followed by a list of such symbols.
4. For efficiency reasons, it is helpful to define CLASSES of
symbols. Then with one production, we can determine if a whole class
of productions are applicable, and we can often avoid stating all the
productions. To specify classes, we start with the pseudo-op
<CLASSES>, and then, on a one-class-to-a-line basis, we specif%y:
class-name class-element class-element ....
where class-name must begin with a @. All of the class-elements must
have already been defined in 2 or 3 above, and hence may not be
classe-names.
5. Finally we are ready to state the productions. We give the
pseudo-op <PRODUCTIONS> to start this off. The very first production
listed will be the starting point for the production interpreter.
The interpreter will start AFTER ONE SCAN, so that the top of the
stack will already have the first "parse token" on it.
The syntax of each production is:
label: LHS →→ RHS EXEC xxx SCAN α ¬yyy #zzz ↓↓ ↑www
This specifies a production. These symbols need at least some
explanation:
-- the label is the "name" of the production. It is optional. All
labels in your program must begin with the character % and
must be unique in their first 6 characters. There may not be
a space between the last character of the label and the : .
-- LHS is a left-hand-side-list. That is, a list of symbols declared
in 2,3, or 4 above which is to be matched against the current
top elements of the stack.
-- RHS is a right-hand-side-list. This is the list of symbols which
is to replace the LHS on the stack if the production
succeeds.
-- EXEC specifies that a list of execs is to be called. The names
that you give in this list of exec routines should be the
names of the procedures that you make up in your exec routine
file.
-- SCAN specifies that we are to scan before going on to the next
production.
-- the # ¬ ↑ and ↓↓ parts specify where "control" of the production
interpreter is to go.
The following things are omittable:
-- the label
-- the →→ and right-hand-side-list, in which case it is assumed that
if the production succeeds, the stack is restored to its
original condition.
-- the right-hand-side-list, in which case nothing is restored to the
stack.
-- EXEC and the list of exec routines which follows it.
-- SCAN if you do not want to scan. Note that SCAN α means scan α
times before going to the success spot.
-- the #zzz may be omitted.
-- the ↓↓ may be omitted.
__ the ↑www may be omitted.
Now for the interpretation. When the interpreter is pointed at this
production, the stack is compared against the left-hand-side-list.
The last element in this list is compared against the current top of
the stack, and so on back the list and up the stack. Compares are
equal if:
-- the symbols actually match.
-- the list element in the left-hand-side is SG, which stands for
sigma, and compares equal to anything. SG is thus a meta
symbol of the production compiler, and may not be used in any
other way.
-- the list element in the left-hand-side is a class symbol and the
corresponding stack element belongs to that class. If the
left-hand-side-list fails to match the top positions of the
stack, the production interpreter sees a failure and goes to
another production. The next production is normally the one
following the one which failed. If you want to specify some
other failure production, the #zzz construct is used, where
zzz is the label of the production you want to go to.
If the left-hand-side-list matches the top of the stack, then:
1. These stack elements are popped off the stack temporarily.
2. If any exec routines have been specified, these are called. The
statement EXEC FOOBAZ causes the subroutine called FOOBAZ to
be called. The statement EXEC @DCL FOOBAZ causes the
following to happen: FOOBAZ is a routine which wants to know
some class information about one of the left-hand-side
elements. This element is specified by the @DCL (this was
the same thing that occurred on the left-hand-side). Just how
this class information is passed is not important here. Upon
return from all exec routines, we continue to:
3. The stack is then fixed back up, reflecting any changes as
specified in the right-hand-side-list.
4. The scanner is called if SCAN was specified.
5. The production has now "succeeded". We must cast around for the
next production. Each production must have some
specification of what to do on success. If you only
specified a ¬yyy , then we do a "jump" to the production with
label yyy. If you specified a ↑aaa and a ¬yyy , we do a "push
jump" to production aaa. When we return from that
"subroutine," we will go to production yyy. The "pop jump"
is specified by ↓↓. It makes no sense to say: ¬aaa ↓↓ since
these two operations conflict (in fact, the pop jump will
take precedence).
The list of productions must be followed by the pseudo-op <END>
II. Compiling your productions and execs.
If you have completed your file of productions as above, you are
ready to give it a whirl. Suppose you named that file MYPROD. Then
make another file called, say PRODGO, which is a text file that looks
exactly like this:
QQ.=MYPROD/NOLIST/NOLO/NON SYS:HYRS
QQ=HEAD[S,AIL]+PARSER[1,RFS]+QQ/FORWARD,/SAIL SCANR[1,RFS],EXECS
This will be a command file which starts the whole mess in motion.
You need a file containing exec routines stated in the production
list. If you merely want to play, and just want dummy exec routines,
allow me to suggest the following glorious SAIL macro:
DEFINE exec(x)="internal procedure x;
out(1,"" x ""&('15&'12));";
The scanner has the TTY inited on channel 1, so this will cause the
name of the exec routine to be printed out. The macro is used very
simply. If you asked for routines called RETRY,ERRG,STORE, just say
exec (RETRY)
exec (ERRG)
exec (STORE)
These definitions of exec routines should be in a file called EXECS
(or you may change the name in the command file, see above). This
file has an ENTRY statement at the beginning to avoid putting out a
starting address. The scanner is the guy who is "started," not the
exec routine package. There is a sample exec routine file in VI
below.
Now you are ready to try things. To get everything to compile and
load, type to the time-sharing system EXECUTE @PRODGO. This causes
the command file to be scanned and the world to whirr.
III. Scanner and debugger.
The scanner will ask you some stupid questions, like "filename for
scan table". If you used the command file format cited above, type
QQ followed by a carriage return. The scanner is now loading the
scan table and the reserved word symbol table. Then you will get
"input device" spewed out. It is asking where to get its input from.
Tell it, with TTY or DSK or something. If it requires a file name,
it will not hesitate to ask. If you give it a bum filename, it will
not hesitate to ask again.
Then the production interpreter is initialized, the scanner is called
once, and then the first production is interpreted. But to allow you
to watch the progress of the production analysis and stack twiddling,
we print out the production name (the label you gave the production),
and the top 3 entries on the stack. If there was no production name,
we take the nearest one (say RP0) and start making RP0A, RP0B, etc.
Then the onus is on you the user to tell the interpreter what is to
happen next. This is done by typing something at the interpreter:
P --proceed. Go on with what you were doing. When in doubt
about the fancy features below, just keep typing P.
B --set or remove a breakpoint on a production. Follow this
immediately by S to set a breakpoint, or R to remove one,
followed by the production name, followed by a space. The
production interpreter will stop whenever it begins
consideration of a production which has a breakpoint on it.
#M --set mode. The number preceeding the M should be:
1 -- print just exec routine calls.
2 -- print nothing
3 -- print everything (exec routines and prodcutions)
4 -- multiple proceed. If you type 4MP, the thing will just
scream along merrily showing you whatever you have asked for.
5 -- display only the line being scanned.
6 -- super-fast mode. Ignore the user altogether.
D --go to DDT or RAID.
#S --look at stack element of number #. 0 is the top of stack,
and positive numbers go back up the stack.
These features should allow you to monitor the progress of your
syntax analyzer, and see where you may need to change productions.
IV. Attaching semantic routines to your productions.
This section is not intended for those people who just want to play.
It will get rather involved with SAIL constructs, and is not at all
necessary for just playing around with syntax. Now that only the most
determined people are still reading, allow me to say that it is not
really very complicated.
If the production interpreter finds a successful production, it pops
off the left-hand-side elements of the stack into some temporaries in
core. It then makes up a copy, in core temporaries, of how the
right-hand-side will look when it is finally restored. There are two
stacks that are manipulated in all these operations, the parse stack
and the semantic stack. The operations on the parse and semantic
stacks are always identical.
Consider the following production:
ELHS E SG →→ ELHS T SG
When the interpreter pops off the left-hand-side elements, it
associates with each a number (distance from top of stack):
SG 0
E 1
ELHS 2
These numbers are the references to the semantic stack entries
associated with the parse tokens stated. As the temporary copy of
the right-hand-side is created in core, there are several rules:
-- if we are putting on a brand new parse token at position α (α is
the distance from the new top of stack), then the semantic
element from the αth location on the left-hand-side is copied
for the α location on the right-hand-side.
-- if we are putting on a parse token which occurs on the
left-hand-side at location β, then we copy location β's
semantic entry to the right-hand-side location α. If there is
more than one occurrence of the parse token on the
left-hand-side, we assume the one with the highest β.
So this information is sufficient to specify how the semantic entries
come and go. The purpose of the first rule above is to facilitate
such things as
T SG →→ E SG
promoting a term to an expression, and saving the semantics of the
term to associate with the expression.
Now for the routines which access the semantic stack entries. The
routine GETSEM (integer β) returns the integer entry in the
left-hand-side entry β. The routine PUTSEM (integer α; integer word)
deposits the semantics "word" into right-hand-side location α. The
SAIL declarations for these routines should be made in your program,
and should look like:
EXTERNAL INTEGER PROCEDURE GETSEM(INTEGER β);
EXTERNAL PROCEDURE PUTSEM(INTEGER α,word);
When the scanner scans an identifier, it records in the semantic
stack entry a zero. The string for the most recently scanned
identifier or string constant (same format as SAIL string constants )
is in the string IDNAME (which you must have declared EXTERNAL).
The string for the next most recently scanned identifier is in the
string OLDID. If the scanner saw a number, the decimal value of the
number is in SCNDEC, and the octal in SCNOCT. If there was an 8 or 9
in the input stream, SCNOCT is negative.
There is a facility for passing class-type information to exec
routines. The production interpreter stores a number in GENCLS (which
you must declare as an external integer) before calling the exec
routine. This number contains the type information you need. The
problem of interpreting that number is complicated but not
impossible. The elementary statement is: the number is the index
(starting at 0) into the line of symbols in which that class type is
declared. So if you declare the class type
@DCL INTEGER REAL BOOLEAN COMPLEX STRING
and if you wrote a production
%CON: @DCL IDENT , →→ @DCL EXEC ENTID SCAN 2 ¬%CON
the when ENTID is called, GENCLS will contain the following numbers:
parse token which compared
equal to @DCL
INTEGER 0
REAL 1
BOOLEAN 2
COMPLEX 3
STRING 4
This simple explanation breaks down when, say BOOLEAN was declared to
be a member of another class, and THAT CLASS WAS DECLARED BEFORE @DCL
WAS. So the easy rule is: declare the classes that will be required
to provide class information to semantic routines first, and then
those classes which are used only for parsing purposes.
V. Sample Production File:
<SYMBOLS>
; $SEMI - $MINUS + $PLUS * $TIMS / $DIV % $MOD ← $LARW
∧ $AND ⊗ $SHIFT : $COLW
( $LPRN ) $RPRN , $COMA
<TERMINALS>
: ( ) , ; - + * / % ← ∧ ⊗ IDENT NUMBR STRIN
<RESERVED-WORDS>
IF THEN ELSE BEGIN END GO LABEL INTEGER REAL
<NON-TERMINAL-SYMBOLS>
BLAT S T E P LHS SIFC EIFC
<CLASSES>
@PM + - ⊗ ∧
@TD * /
@DCL INTEGER REAL LABEL
<PRODUCTIONS>
MUMBLE IS THE WORD YOU USE TO MAKE COMMENTS IN THE PRODUCTIONS.
FOR OBVIOUS REASONS, WE CANNOT USE "COMMENT". THE
END OF THE COMMENT IS SIGNALLED BY A SEMICOLON ;
%PR1: BEGIN →→ BLAT BEGIN EXEC BEG1 SCAN 2 ¬%DS #%ERT
%DS: @DCL SG →→ SG EXEC @DCL DECL SCAN ¬%IDL #%S1
%IDL: IDENT , →→ EXEC ENTID SCAN 2 ¬%IDL
IDENT ; →→ EXEC ENTID ENDD SCAN 2 ¬%DS
%S1: BEGIN SG SCAN ¬%DS
IDENT ← →→ LHS EXEC TWID SCAN 2 ¬%EX1
IF SG →→ SIFC SG EXEC IFBEG SCAN ¬%EX1
GO IDENT →→ S EXEC TRA SCAN ¬%S9
IDENT : →→ EXEC ENTLAB SCAN 2 ¬%S1
%ERT: SG →→ EXEC ERRTR SCAN ¬%S1
MUMBLE THIS LAST IS THE ERROR MAN. HE MIGHT THROW IN THE TOWEL ;
%EX1: ( SG SCAN ¬%EX1
IDENT SG →→ P SG ¬%P2
NUMBR SG →→ P SG ¬%P2 #%ERT
%P2: T @TD P SG →→ T SG EXEC @TD TIMDIV ¬%T2
P SG →→ T SG ¬%T2
%T2: T @TD SCAN 2 ¬%EX1
E @PM T SG →→ E SG EXEC @PM PLUSM ¬%E2
T SG →→ E SG ¬%E2
%E2: E @PM SCAN 2 ¬%EX1
( E ) →→ P SCAN ¬%P2
SIFC E THEN →→ SIFC EXEC BOOL SCAN 2 ¬%S1
LHS E SG →→ S SG EXEC STORE ¬%S9
%S9: BEGIN S ; →→ BEGIN SCAN 2 ¬%S1
BEGIN S END →→ S SCAN ¬%S9
SIFC S ELSE EXEC IF1 SCAN 2 ¬%S1
SIFC S SG →→ S SG EXEC IF2 ¬%S9
SIFC S ELSE S SG →→ S SG EXEC IF3 ¬%S9
BLAT S ; →→ EXEC DONES ¬%S1
<END>
VI. Sample exec routine file:
ENTRY DONES;
BEGIN "EXECS"
DEFINE EXEC(X)="INTERNAL PROCEDURE X;
OUT(1,"" X ""&('15&'12));";
EXEC (BEG1)
EXEC (DECL)
EXEC (ENTID)
EXEC (ENDD)
EXEC (TWID)
EXEC (IFBEG)
EXEC (TRA)
EXEC (ENTLAB)
EXEC (ERRTR)
EXEC (TIMDIV)
EXEC (PLUSM)
EXEC (BOOL)
EXEC (STORE)
EXEC (IF1)
EXEC (IF2)
EXEC (IF3)
EXEC (DONES)
END